Marks: 60
The stock market has consistently proven to be a good place to invest in and save for the future. There are a lot of compelling reasons to invest in stocks. It can help in fighting inflation, create wealth, and also provides some tax benefits. Good steady returns on investments over a long period of time can also grow a lot more than seems possible. Also, thanks to the power of compound interest, the earlier one starts investing, the larger the corpus one can have for retirement. Overall, investing in stocks can help meet life's financial aspirations.
It is important to maintain a diversified portfolio when investing in stocks in order to maximise earnings under any market condition. Having a diversified portfolio tends to yield higher returns and face lower risk by tempering potential losses when the market is down. It is often easy to get lost in a sea of financial metrics to analyze while determining the worth of a stock, and doing the same for a multitude of stocks to identify the right picks for an individual can be a tedious task. By doing a cluster analysis, one can identify stocks that exhibit similar characteristics and ones which exhibit minimum correlation. This will help investors better analyze stocks across different market segments and help protect against risks that could make the portfolio vulnerable to losses.
Trade&Ahead is a financial consultancy firm who provide their customers with personalized investment strategies. They have hired you as a Data Scientist and provided you with data comprising stock price and some financial indicators for a few companies listed under the New York Stock Exchange. They have assigned you the tasks of analyzing the data, grouping the stocks based on the attributes provided, and sharing insights about the characteristics of each group.
conda install -c districtdatalabs yellowbrick
Collecting package metadata (current_repodata.json): ...working... done Solving environment: ...working... done # All requested packages already installed. Note: you may need to restart the kernel to use updated packages.
==> WARNING: A newer version of conda exists. <==
current version: 23.3.1
latest version: 23.5.0
Please update conda by running
$ conda update -n base -c defaults conda
Or to minimize the number of packages updated during conda update use
conda install conda=23.5.0
# Libraries to help with reading and manipulating data
import numpy as np
import pandas as pd
# Libraries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_theme(style='darkgrid')
# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)
# to scale the data using z-score
from sklearn.preprocessing import StandardScaler
# to compute distances
from scipy.spatial.distance import cdist, pdist
# to perform k-means clustering and compute silhouette scores
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
# to visualize the elbow curve and silhouette scores
from yellowbrick.cluster import KElbowVisualizer, SilhouetteVisualizer
# to perform hierarchical clustering, compute cophenetic correlation, and create dendrograms
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage, cophenet
# to suppress warnings
import warnings
warnings.filterwarnings("ignore")
The initial steps to get an overview of any dataset is to:
#loading the dataset
data = pd.read_csv("C:/Users/Cooldream/Downloads/stock_data.csv")
#Shape of Dataframe
data.shape
(340, 15)
#Summary of Dataframe
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 340 entries, 0 to 339 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Ticker Symbol 340 non-null object 1 Security 340 non-null object 2 GICS Sector 340 non-null object 3 GICS Sub Industry 340 non-null object 4 Current Price 340 non-null float64 5 Price Change 340 non-null float64 6 Volatility 340 non-null float64 7 ROE 340 non-null int64 8 Cash Ratio 340 non-null int64 9 Net Cash Flow 340 non-null int64 10 Net Income 340 non-null int64 11 Earnings Per Share 340 non-null float64 12 Estimated Shares Outstanding 340 non-null float64 13 P/E Ratio 340 non-null float64 14 P/B Ratio 340 non-null float64 dtypes: float64(7), int64(4), object(4) memory usage: 40.0+ KB
Observations:
#Copy of the data
data1 = data.copy()
# lets convert object to categorical columns to reduce the space
data1.columns[data1.dtypes=='object'] # select columns which are object
Index(['Ticker Symbol', 'Security', 'GICS Sector', 'GICS Sub Industry'], dtype='object')
#lets convert object to categorical columns to reduce the space
for column in data1.columns[data1.dtypes=='object']:
data1[column] = data1[column].astype('category')
data1.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 340 entries, 0 to 339 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Ticker Symbol 340 non-null category 1 Security 340 non-null category 2 GICS Sector 340 non-null category 3 GICS Sub Industry 340 non-null category 4 Current Price 340 non-null float64 5 Price Change 340 non-null float64 6 Volatility 340 non-null float64 7 ROE 340 non-null int64 8 Cash Ratio 340 non-null int64 9 Net Cash Flow 340 non-null int64 10 Net Income 340 non-null int64 11 Earnings Per Share 340 non-null float64 12 Estimated Shares Outstanding 340 non-null float64 13 P/E Ratio 340 non-null float64 14 P/B Ratio 340 non-null float64 dtypes: category(4), float64(7), int64(4) memory usage: 58.1 KB
The data1 is converted to categorical variables and memory is increased to 58.1kb from 40kb, therefore will work with object variables only.
# let's view a sample of the data
data.sample(n=5, random_state=1)
| Ticker Symbol | Security | GICS Sector | GICS Sub Industry | Current Price | Price Change | Volatility | ROE | Cash Ratio | Net Cash Flow | Net Income | Earnings Per Share | Estimated Shares Outstanding | P/E Ratio | P/B Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 102 | DVN | Devon Energy Corp. | Energy | Oil & Gas Exploration & Production | 32.000000 | -15.478079 | 2.923698 | 205 | 70 | 830000000 | -14454000000 | -35.55 | 4.065823e+08 | 93.089287 | 1.785616 |
| 125 | FB | Information Technology | Internet Software & Services | 104.660004 | 16.224320 | 1.320606 | 8 | 958 | 592000000 | 3669000000 | 1.31 | 2.800763e+09 | 79.893133 | 5.884467 | |
| 11 | AIV | Apartment Investment & Mgmt | Real Estate | REITs | 40.029999 | 7.578608 | 1.163334 | 15 | 47 | 21818000 | 248710000 | 1.52 | 1.636250e+08 | 26.335526 | -1.269332 |
| 248 | PG | Procter & Gamble | Consumer Staples | Personal Products | 79.410004 | 10.660538 | 0.806056 | 17 | 129 | 160383000 | 636056000 | 3.28 | 4.913916e+08 | 24.070121 | -2.256747 |
| 238 | OXY | Occidental Petroleum | Energy | Oil & Gas Exploration & Production | 67.610001 | 0.865287 | 1.589520 | 32 | 64 | -588000000 | -7829000000 | -10.23 | 7.652981e+08 | 93.089287 | 3.345102 |
#First 5 rows
data.head()
| Ticker Symbol | Security | GICS Sector | GICS Sub Industry | Current Price | Price Change | Volatility | ROE | Cash Ratio | Net Cash Flow | Net Income | Earnings Per Share | Estimated Shares Outstanding | P/E Ratio | P/B Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | AAL | American Airlines Group | Industrials | Airlines | 42.349998 | 9.999995 | 1.687151 | 135 | 51 | -604000000 | 7610000000 | 11.39 | 6.681299e+08 | 3.718174 | -8.784219 |
| 1 | ABBV | AbbVie | Health Care | Pharmaceuticals | 59.240002 | 8.339433 | 2.197887 | 130 | 77 | 51000000 | 5144000000 | 3.15 | 1.633016e+09 | 18.806350 | -8.750068 |
| 2 | ABT | Abbott Laboratories | Health Care | Health Care Equipment | 44.910000 | 11.301121 | 1.273646 | 21 | 67 | 938000000 | 4423000000 | 2.94 | 1.504422e+09 | 15.275510 | -0.394171 |
| 3 | ADBE | Adobe Systems Inc | Information Technology | Application Software | 93.940002 | 13.977195 | 1.357679 | 9 | 180 | -240840000 | 629551000 | 1.26 | 4.996437e+08 | 74.555557 | 4.199651 |
| 4 | ADI | Analog Devices, Inc. | Information Technology | Semiconductors | 55.320000 | -1.827858 | 1.701169 | 14 | 272 | 315120000 | 696878000 | 0.31 | 2.247994e+09 | 178.451613 | 1.059810 |
#Last 5 rows
data.head()
| Ticker Symbol | Security | GICS Sector | GICS Sub Industry | Current Price | Price Change | Volatility | ROE | Cash Ratio | Net Cash Flow | Net Income | Earnings Per Share | Estimated Shares Outstanding | P/E Ratio | P/B Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | AAL | American Airlines Group | Industrials | Airlines | 42.349998 | 9.999995 | 1.687151 | 135 | 51 | -604000000 | 7610000000 | 11.39 | 6.681299e+08 | 3.718174 | -8.784219 |
| 1 | ABBV | AbbVie | Health Care | Pharmaceuticals | 59.240002 | 8.339433 | 2.197887 | 130 | 77 | 51000000 | 5144000000 | 3.15 | 1.633016e+09 | 18.806350 | -8.750068 |
| 2 | ABT | Abbott Laboratories | Health Care | Health Care Equipment | 44.910000 | 11.301121 | 1.273646 | 21 | 67 | 938000000 | 4423000000 | 2.94 | 1.504422e+09 | 15.275510 | -0.394171 |
| 3 | ADBE | Adobe Systems Inc | Information Technology | Application Software | 93.940002 | 13.977195 | 1.357679 | 9 | 180 | -240840000 | 629551000 | 1.26 | 4.996437e+08 | 74.555557 | 4.199651 |
| 4 | ADI | Analog Devices, Inc. | Information Technology | Semiconductors | 55.320000 | -1.827858 | 1.701169 | 14 | 272 | 315120000 | 696878000 | 0.31 | 2.247994e+09 | 178.451613 | 1.059810 |
# Statisitical summary of all the data
data.describe(include="all").T
| count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Ticker Symbol | 340 | 340 | AAL | 1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Security | 340 | 340 | American Airlines Group | 1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| GICS Sector | 340 | 11 | Industrials | 53 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| GICS Sub Industry | 340 | 104 | Oil & Gas Exploration & Production | 16 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Current Price | 340.0 | NaN | NaN | NaN | 80.862345 | 98.055086 | 4.5 | 38.555 | 59.705 | 92.880001 | 1274.949951 |
| Price Change | 340.0 | NaN | NaN | NaN | 4.078194 | 12.006338 | -47.129693 | -0.939484 | 4.819505 | 10.695493 | 55.051683 |
| Volatility | 340.0 | NaN | NaN | NaN | 1.525976 | 0.591798 | 0.733163 | 1.134878 | 1.385593 | 1.695549 | 4.580042 |
| ROE | 340.0 | NaN | NaN | NaN | 39.597059 | 96.547538 | 1.0 | 9.75 | 15.0 | 27.0 | 917.0 |
| Cash Ratio | 340.0 | NaN | NaN | NaN | 70.023529 | 90.421331 | 0.0 | 18.0 | 47.0 | 99.0 | 958.0 |
| Net Cash Flow | 340.0 | NaN | NaN | NaN | 55537620.588235 | 1946365312.175789 | -11208000000.0 | -193906500.0 | 2098000.0 | 169810750.0 | 20764000000.0 |
| Net Income | 340.0 | NaN | NaN | NaN | 1494384602.941176 | 3940150279.327937 | -23528000000.0 | 352301250.0 | 707336000.0 | 1899000000.0 | 24442000000.0 |
| Earnings Per Share | 340.0 | NaN | NaN | NaN | 2.776662 | 6.587779 | -61.2 | 1.5575 | 2.895 | 4.62 | 50.09 |
| Estimated Shares Outstanding | 340.0 | NaN | NaN | NaN | 577028337.75403 | 845849595.417695 | 27672156.86 | 158848216.1 | 309675137.8 | 573117457.325 | 6159292035.0 |
| P/E Ratio | 340.0 | NaN | NaN | NaN | 32.612563 | 44.348731 | 2.935451 | 15.044653 | 20.819876 | 31.764755 | 528.039074 |
| P/B Ratio | 340.0 | NaN | NaN | NaN | -1.718249 | 13.966912 | -76.119077 | -4.352056 | -1.06717 | 3.917066 | 129.064585 |
Observations:
# Null Values in Dataframe
data.isna().sum()
Ticker Symbol 0 Security 0 GICS Sector 0 GICS Sub Industry 0 Current Price 0 Price Change 0 Volatility 0 ROE 0 Cash Ratio 0 Net Cash Flow 0 Net Income 0 Earnings Per Share 0 Estimated Shares Outstanding 0 P/E Ratio 0 P/B Ratio 0 dtype: int64
# Duplicate Values in Dataframe
data.duplicated().sum()
print('Total number of duplicate values are:', data.duplicated().sum())
data.duplicated().value_counts()
Total number of duplicate values are: 0
False 340 dtype: int64
Observation:
There are no duplicate or null values in the dataset rows.
Ticker and Security are duplicate columns with exactly same information and one column can be dropped
#Dropping Security column from dataset
df=data.drop(['Security'], axis=1)
#checking frist 5 rows for df dataset
df.head(5)
| Ticker Symbol | GICS Sector | GICS Sub Industry | Current Price | Price Change | Volatility | ROE | Cash Ratio | Net Cash Flow | Net Income | Earnings Per Share | Estimated Shares Outstanding | P/E Ratio | P/B Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | AAL | Industrials | Airlines | 42.349998 | 9.999995 | 1.687151 | 135 | 51 | -604000000 | 7610000000 | 11.39 | 6.681299e+08 | 3.718174 | -8.784219 |
| 1 | ABBV | Health Care | Pharmaceuticals | 59.240002 | 8.339433 | 2.197887 | 130 | 77 | 51000000 | 5144000000 | 3.15 | 1.633016e+09 | 18.806350 | -8.750068 |
| 2 | ABT | Health Care | Health Care Equipment | 44.910000 | 11.301121 | 1.273646 | 21 | 67 | 938000000 | 4423000000 | 2.94 | 1.504422e+09 | 15.275510 | -0.394171 |
| 3 | ADBE | Information Technology | Application Software | 93.940002 | 13.977195 | 1.357679 | 9 | 180 | -240840000 | 629551000 | 1.26 | 4.996437e+08 | 74.555557 | 4.199651 |
| 4 | ADI | Information Technology | Semiconductors | 55.320000 | -1.827858 | 1.701169 | 14 | 272 | 315120000 | 696878000 | 0.31 | 2.247994e+09 | 178.451613 | 1.059810 |
# list of all categorical columns
df.columns[df.dtypes=='object']
#objectdf = df.select_dtypes(include="object").columns.tolist()
Index(['Ticker Symbol', 'GICS Sector', 'GICS Sub Industry'], dtype='object')
objectdf = ['GICS Sector', 'GICS Sub Industry']
objectdf
['GICS Sector', 'GICS Sub Industry']
for column in objectdf:
print('Unique values of',column,':')
#print(df[column] .unique())
print(df[column].value_counts())
print('..................'*5)
Unique values of GICS Sector : Industrials 53 Financials 49 Health Care 40 Consumer Discretionary 40 Information Technology 33 Energy 30 Real Estate 27 Utilities 24 Materials 20 Consumer Staples 19 Telecommunications Services 5 Name: GICS Sector, dtype: int64 .......................................................................................... Unique values of GICS Sub Industry : Oil & Gas Exploration & Production 16 REITs 14 Industrial Conglomerates 14 Electric Utilities 12 Internet Software & Services 12 Health Care Equipment 11 MultiUtilities 11 Banks 10 Property & Casualty Insurance 8 Diversified Financial Services 7 Biotechnology 7 Pharmaceuticals 6 Packaged Foods & Meats 6 Oil & Gas Refining & Marketing & Transportation 6 Semiconductors 6 Diversified Chemicals 5 Consumer Finance 5 Integrated Oil & Gas 5 Industrial Machinery 5 Managed Health Care 5 Airlines 5 Health Care Facilities 5 Internet & Direct Marketing Retail 4 Residential REITs 4 Retail REITs 4 Soft Drinks 4 Research & Consulting Services 4 Asset Management & Custody Banks 4 Specialty Chemicals 4 Railroads 4 Aerospace & Defense 4 Integrated Telecommunications Services 4 Building Products 4 Hotels, Resorts & Cruise Lines 4 Restaurants 3 Cable & Satellite 3 Air Freight & Logistics 3 Household Products 3 Regional Banks 3 IT Consulting & Other Services 3 Construction & Farm Machinery & Heavy Trucks 3 Life & Health Insurance 3 Health Care Distributors 3 Specialized REITs 3 Oil & Gas Equipment & Services 3 Insurance Brokers 3 Specialty Stores 3 Fertilizers & Agricultural Chemicals 2 Tobacco 2 Leisure Products 2 Advertising 2 Data Processing & Outsourced Services 2 Paper Packaging 2 Construction Materials 2 Application Software 2 Homebuilding 2 Automobile Manufacturers 2 Investment Banking & Brokerage 2 Broadcasting & Cable TV 2 Auto Parts & Equipment 2 Health Care Supplies 2 Electronic Components 2 Gold 1 Environmental Services 1 Home Furnishings 1 Household Appliances 1 Apparel, Accessories & Luxury Goods 1 Specialty Retail 1 Life Sciences Tools & Services 1 Publishing 1 Human Resource & Employment Services 1 Steel 1 Housewares & Specialties 1 Thrifts & Mortgage Finance 1 Brewers 1 Technology, Hardware, Software and Supplies 1 Personal Products 1 Industrial Gases 1 Office REITs 1 Multi-Sector Holdings 1 Alternative Carriers 1 Computer Hardware 1 Distributors 1 Agricultural Products 1 Metal & Glass Containers 1 Financial Exchanges & Data 1 Water Utilities 1 Home Entertainment Software 1 Drug Retail 1 Electrical Components & Equipment 1 Semiconductor Equipment 1 Multi-line Insurance 1 Copper 1 Electronic Equipment & Instruments 1 Diversified Commercial Services 1 Consumer Electronics 1 Tires & Rubber 1 Industrial Materials 1 Motorcycle Manufacturers 1 Technology Hardware, Storage & Peripherals 1 Real Estate Services 1 Trucking 1 Networking Equipment 1 Casinos & Gaming 1 Name: GICS Sub Industry, dtype: int64 ..........................................................................................
Observations:
Questions:
#plot count
def hist_box(data, column):
#subplots of histogram and boxplot
fig, (ax_hist, ax_box) = plt.subplots(2 , figsize = (13, 6), gridspec_kw={"height_ratios": (0.7, 0.3)})
#plotting the title of graphs
fig.suptitle(column)
# plotting histogram
sns.histplot(data=data, x = column, ax=ax_hist, kde = True ) #plotting on first axis
#plotting mean line in histogram
Mean=data[column].mean()
ax_hist.axvline(x = Mean, color ='orange', linestyle="--") #plotting mean line on first axis
#plotting Boxplot
sns.boxplot(data = data, x=column, ax=ax_box, showfliers=True, color = 'yellow') #plotting on 2nd axis
plt.show()
def bar_plot(data, column):
plt.figure(figsize = (5,3))
plt.title( column , loc = 'center', fontsize = 13) # Plot title on top of plot
# plot the countplot for distribution
sns.countplot(data = df, x = column, palette = 'Set2')
plt.xticks(rotation=90)
plt.show()
print('************' *10)
print(data[column].value_counts().head(6))
Current Price
#plot for Current Price
hist_box(df, 'Current Price')
df160 = df[df['Current Price'] < 160]
df160.shape
(317, 14)
hist_box(df160,'Current Price')
Observation:
Price Change
#plot for Price Change
hist_box(df, 'Price Change')
Observation:
Volatility
#plot for Volatility
hist_box(df, 'Volatility')
Observation:
ROE
# Plot for ROE
hist_box(df, 'ROE')
df50 = df[df['ROE'] < 50]
df50.shape
(298, 14)
hist_box(df50, 'ROE')
Observation:
Cash Ratio
#plot for Cash Ratio
hist_box(df, 'Cash Ratio')
Observation:
Net Cash Flow
#plot for Net Cash Flow
hist_box(df, 'Net Cash Flow')
Observation:
Net Income
#plot for Net Income
hist_box(df, 'Net Income')
dfnet0 = df[df['Net Income'] < 0]
dfnet0.shape
(32, 14)
Observation:
Earnings Per Share
#plot for Net Income
hist_box(df, 'Earnings Per Share')
dfEPS0 = df[df['Earnings Per Share'] < 0]
dfEPS0.shape
(34, 14)
Observations:
Estimated Shares Outstanding
#plot for Estimated Shares Outstanding
hist_box(df, 'Estimated Shares Outstanding')
Observation:
P/E Ratio and P/B Ratio
P/E Ratio
#plot for P/E Ratio
hist_box(df, 'P/E Ratio')
df90PE = df[(df['P/E Ratio'] > 80) & (df['P/E Ratio'] < 100)]
df90PE.shape
(20, 14)
dfhighPE = df[df['P/E Ratio'] > 200]
dfhighPE
| Ticker Symbol | GICS Sector | GICS Sub Industry | Current Price | Price Change | Volatility | ROE | Cash Ratio | Net Cash Flow | Net Income | Earnings Per Share | Estimated Shares Outstanding | P/E Ratio | P/B Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 19 | ALXN | Health Care | Biotechnology | 190.750000 | 22.338380 | 2.022921 | 2 | 195 | 66000000 | 144000000 | 0.68 | 211764705.9 | 280.514706 | -14.171389 |
| 26 | AMZN | Consumer Discretionary | Internet & Direct Marketing Retail | 675.890015 | 32.268105 | 1.460386 | 4 | 58 | 1333000000 | 596000000 | 1.28 | 465625000.0 | 528.039074 | 3.904430 |
| 226 | NFLX | Information Technology | Internet Software & Services | 114.379997 | 11.145654 | 2.605949 | 6 | 65 | 695722000 | 122641000 | 0.29 | 422900000.0 | 394.413783 | -5.700168 |
Observations
P/B Ratio
#plot for P/B Ratio
hist_box(df, 'P/B Ratio')
Observations:
GICS Sub Industry
#Countplot for categorical value: 'GICS Sub Industry'
plt.figure(figsize = (20,4))
plt.title( 'GICS Sub Industry', loc = 'center', fontsize = 13) # Plot title on top of plot
# plot the countplot for distribution
sns.countplot(data = df, x = 'GICS Sub Industry', palette = 'Paired')
plt.xticks(rotation=90);
print(data['GICS Sub Industry'].value_counts().head(6))
Oil & Gas Exploration & Production 16 REITs 14 Industrial Conglomerates 14 Electric Utilities 12 Internet Software & Services 12 Health Care Equipment 11 Name: GICS Sub Industry, dtype: int64
Observations
GICS Sector
bar_plot(df, 'GICS Sector')
************************************************************************************************************************ Industrials 53 Financials 49 Health Care 40 Consumer Discretionary 40 Information Technology 33 Energy 30 Name: GICS Sector, dtype: int64
Observations:
def boxplot(data, column, target):
plt.figure(figsize=(15, 7))
plt.title( "column" , loc = 'center', fontsize = 13)
sns.boxplot(data = df, x=column, y=target, palette = 'bright')
plt.xticks(rotation=60)
plt.show()
plt.figure(figsize=(15, 7))
sns.barplot(data = df, x=column, y=target, errorbar=('ci', False), palette = 'Set2')
plt.xticks(rotation=60)
plt.show()
How are the different variables correlated with each other?
# correlation check
plt.figure(figsize=(15, 7))
sns.heatmap(
df.corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral"
)
plt.show()
Observations:
#lets plot numerical columns
sns.pairplot(data=df, corner = 'True')
<seaborn.axisgrid.PairGrid at 0x2261a8c29e0>
Observations:
Let's check the stocks of which economic sector have seen the maximum price increase on average.
GICS Sector vs Price Change
boxplot(df, 'GICS Sector', 'Price Change')
Observations:
Cash ratio provides a measure of a company's ability to cover its short-term obligations using only cash and cash equivalents. Let's see how the average cash ratio varies across economic sectors.
GICS Sector & Cash Ratio
boxplot(df, 'GICS Sector', 'Cash Ratio')
Observations
P/E ratios can help determine the relative value of a company's shares as they signify the amount of money an investor is willing to invest in a single share of a company per dollar of its earnings. Let's see how the P/E ratio varies, on average, across economic sectors.
GICS Sector & P/E ratios
boxplot(df, 'GICS Sector', "P/E Ratio" )
Observations:
Inference:
Higher P/E ratio signifies that investors are willing to pay 72 times higher price for security for similar EPS.
Volatility accounts for the fluctuation in the stock price. A stock with high volatility will witness sharper price changes, making it a riskier investment. Let's see how volatility varies, on average, across economic sectors.
GICS Sector & Volatility
boxplot(df, 'GICS Sector', "Volatility" )
Observations:
Inference: Usually consumer market sectors have low volatility like consumer staples, utilities and real estates.
Dataset has no missing data or duplicated values
plt.figure(figsize=(15, 12))
numeric_columns = df.select_dtypes(include=np.number).columns.tolist()
for i, variable in enumerate(numeric_columns):
plt.subplot(3, 4, i + 1)
plt.boxplot(df[variable], whis=1.5)
plt.tight_layout()
plt.title(variable)
plt.show()
Observation:
#Create list of all numerical columns for scaling
num_col = df.select_dtypes(exclude="object").columns.tolist()
num_col
['Current Price', 'Price Change', 'Volatility', 'ROE', 'Cash Ratio', 'Net Cash Flow', 'Net Income', 'Earnings Per Share', 'Estimated Shares Outstanding', 'P/E Ratio', 'P/B Ratio']
# Scaling the data set before clustering
scaler = StandardScaler()
subset = df[num_col].copy()
subset_scaled = scaler.fit_transform(subset)
subset_scaled
array([[-0.39334123, 0.49394985, 0.2727489 , ..., 0.10786299,
-0.65248711, -0.50665346],
[-0.2208372 , 0.35543891, 1.13704483, ..., 1.25027416,
-0.31176905, -0.50420469],
[-0.36719495, 0.6024795 , -0.42700717, ..., 1.09802058,
-0.39150173, 0.09494078],
...,
[ 0.22191276, 0.43953914, -0.20606693, ..., -0.46005754,
2.23363447, -1.58938994],
[-0.54705258, -0.43681096, -0.09781298, ..., -0.37785227,
-0.22271438, 0.11867997],
[-0.3364527 , 1.05104602, 0.14267127, ..., -0.09294161,
0.85490199, 0.24675384]])
type(subset_scaled)
numpy.ndarray
# creating a dataframe of the numpy array of scaled data
subset_scaled_df = pd.DataFrame(subset_scaled, columns=subset.columns)
subset_scaled_df.head()
| Current Price | Price Change | Volatility | ROE | Cash Ratio | Net Cash Flow | Net Income | Earnings Per Share | Estimated Shares Outstanding | P/E Ratio | P/B Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -0.393341 | 0.493950 | 0.272749 | 0.989601 | -0.210698 | -0.339355 | 1.554415 | 1.309399 | 0.107863 | -0.652487 | -0.506653 |
| 1 | -0.220837 | 0.355439 | 1.137045 | 0.937737 | 0.077269 | -0.002335 | 0.927628 | 0.056755 | 1.250274 | -0.311769 | -0.504205 |
| 2 | -0.367195 | 0.602479 | -0.427007 | -0.192905 | -0.033488 | 0.454058 | 0.744371 | 0.024831 | 1.098021 | -0.391502 | 0.094941 |
| 3 | 0.133567 | 0.825696 | -0.284802 | -0.317379 | 1.218059 | -0.152497 | -0.219816 | -0.230563 | -0.091622 | 0.947148 | 0.424333 |
| 4 | -0.260874 | -0.492636 | 0.296470 | -0.265515 | 2.237018 | 0.133564 | -0.202703 | -0.374982 | 1.978399 | 3.293307 | 0.199196 |
sns.heatmap(subset_scaled_df.corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral")
<Axes: >
#lets plot scaled numerical columns
#sns.pairplot(data=subset_scaled_df, corner = 'True')
clusters = range(1, 15)
meanDistortions = []
for k in clusters:
model = KMeans(n_clusters=k)
model.fit(subset_scaled_df)
prediction = model.predict(subset_scaled_df)
distortion = (
sum(
np.min(cdist(subset_scaled_df, model.cluster_centers_, "euclidean"), axis=1)
)
/ subset_scaled_df.shape[0]
)
meanDistortions.append(distortion)
print("Number of Clusters:", k, "\tAverage Distortion:", distortion)
plt.plot(clusters, meanDistortions, "bx-")
plt.xlabel("k")
plt.ylabel("Average Distortion")
plt.title("Selecting k with the Elbow Method", fontsize=20)
Number of Clusters: 1 Average Distortion: 2.5425069919221697 Number of Clusters: 2 Average Distortion: 2.382318498894466 Number of Clusters: 3 Average Distortion: 2.273154903795774 Number of Clusters: 4 Average Distortion: 2.1745559827866363 Number of Clusters: 5 Average Distortion: 2.142986303882605 Number of Clusters: 6 Average Distortion: 2.056310851689408 Number of Clusters: 7 Average Distortion: 1.9790306256336803 Number of Clusters: 8 Average Distortion: 1.962985208935381 Number of Clusters: 9 Average Distortion: 1.9322508609646234 Number of Clusters: 10 Average Distortion: 1.864272315337918 Number of Clusters: 11 Average Distortion: 1.7936374941086612 Number of Clusters: 12 Average Distortion: 1.7490589066183273 Number of Clusters: 13 Average Distortion: 1.7120946526296594 Number of Clusters: 14 Average Distortion: 1.69011308171502
Text(0.5, 1.0, 'Selecting k with the Elbow Method')
**The appropriate value of k from the elbow curve seems to be 5,7,and 9.**
#fit KMeans model and use visualizaer to indicate optimal K value
k_means_df = subset_scaled_df.copy()
Kmodel = KMeans(random_state=1)
visualizer = KElbowVisualizer(model, k=(1, 15), timings=True)
visualizer.fit(k_means_df) # fit the data to the visualizer
visualizer.show()
<Axes: title={'center': 'Distortion Score Elbow for KMeans Clustering'}, xlabel='k', ylabel='distortion score'>
sil_score = []
cluster_list = range(2, 15)
for n_clusters in cluster_list:
clusterer = KMeans(n_clusters=n_clusters, random_state=1)
preds = clusterer.fit_predict((subset_scaled_df))
score = silhouette_score(k_means_df, preds)
sil_score.append(score)
print("For n_clusters = {}, the silhouette score is {})".format(n_clusters, score))
plt.plot(cluster_list, sil_score)
plt.show()
For n_clusters = 2, the silhouette score is 0.43969639509980457) For n_clusters = 3, the silhouette score is 0.4644405674779404) For n_clusters = 4, the silhouette score is 0.4577225970476733) For n_clusters = 5, the silhouette score is 0.43228336443659804) For n_clusters = 6, the silhouette score is 0.4005422737213617) For n_clusters = 7, the silhouette score is 0.3976335364987305) For n_clusters = 8, the silhouette score is 0.40278401969450467) For n_clusters = 9, the silhouette score is 0.3778585981433699) For n_clusters = 10, the silhouette score is 0.13458938329968687) For n_clusters = 11, the silhouette score is 0.1421832155528444) For n_clusters = 12, the silhouette score is 0.2044669621527429) For n_clusters = 13, the silhouette score is 0.23424874810104204) For n_clusters = 14, the silhouette score is 0.12102526472829901)
#fit KMeans model and use visualizaer to indicate optimal K value
model = KMeans(random_state=1)
visualizer = KElbowVisualizer(model, k=(2, 15), metric="silhouette", timings=True)
visualizer.fit(k_means_df) # fit the data to the visualizer
visualizer.show() # finalize and render figure
<Axes: title={'center': 'Silhouette Score Elbow for KMeans Clustering'}, xlabel='k', ylabel='silhouette score'>
From the silhouette scores, 3, 4 and 5 seems to be good value for k.
# finding optimal no. of clusters with silhouette coefficients
visualizer9 = SilhouetteVisualizer(KMeans(9, random_state=1))
visualizer9.fit(subset_scaled_df)
visualizer9.show();
# finding optimal no. of clusters with silhouette coefficients
visualizer7 = SilhouetteVisualizer(KMeans(7, random_state=1))
visualizer7.fit(subset_scaled_df)
visualizer7.show();
# finding optimal no. of clusters with silhouette coefficients
visualizer6 = SilhouetteVisualizer(KMeans(6, random_state=1))
visualizer6.fit(subset_scaled_df)
visualizer6.show();
# finding optimal no. of clusters with silhouette coefficients
visualizer5 = SilhouetteVisualizer(KMeans(5, random_state=1))
visualizer5.fit(subset_scaled_df)
visualizer5.show();
# finding optimal no. of clusters with silhouette coefficients
visualizer4 = SilhouetteVisualizer(KMeans(4, random_state=1))
visualizer4.fit(subset_scaled_df)
visualizer4.show();
# finding optimal no. of clusters with silhouette coefficients
visualizer3 = SilhouetteVisualizer(KMeans(3, random_state=1))
visualizer3.fit(subset_scaled_df)
visualizer3.show();
Since all the SilhouetteVisualizer plot for K= 3,4,5,6,7,9 have negative values and major chunk of data lies in one particular cluster, will use K=5 for final model
kmeans = KMeans(n_clusters=5, random_state=0)
kmeans.fit(subset_scaled_df)
KMeans(n_clusters=5, random_state=0)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
KMeans(n_clusters=5, random_state=0)
# creating a copy of the original data
df1 = df.copy()
# adding kmeans cluster labels to the original and scaled dataframes
#k_means_df["KM_segments"] = kmeans.labels_
df["K_means_segments"] = kmeans.labels_
df.head()
| Ticker Symbol | GICS Sector | GICS Sub Industry | Current Price | Price Change | Volatility | ROE | Cash Ratio | Net Cash Flow | Net Income | Earnings Per Share | Estimated Shares Outstanding | P/E Ratio | P/B Ratio | K_means_segments | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | AAL | Industrials | Airlines | 42.349998 | 9.999995 | 1.687151 | 135 | 51 | -604000000 | 7610000000 | 11.39 | 6.681299e+08 | 3.718174 | -8.784219 | 4 |
| 1 | ABBV | Health Care | Pharmaceuticals | 59.240002 | 8.339433 | 2.197887 | 130 | 77 | 51000000 | 5144000000 | 3.15 | 1.633016e+09 | 18.806350 | -8.750068 | 4 |
| 2 | ABT | Health Care | Health Care Equipment | 44.910000 | 11.301121 | 1.273646 | 21 | 67 | 938000000 | 4423000000 | 2.94 | 1.504422e+09 | 15.275510 | -0.394171 | 4 |
| 3 | ADBE | Information Technology | Application Software | 93.940002 | 13.977195 | 1.357679 | 9 | 180 | -240840000 | 629551000 | 1.26 | 4.996437e+08 | 74.555557 | 4.199651 | 4 |
| 4 | ADI | Information Technology | Semiconductors | 55.320000 | -1.827858 | 1.701169 | 14 | 272 | 315120000 | 696878000 | 0.31 | 2.247994e+09 | 178.451613 | 1.059810 | 0 |
df["K_means_segments"].value_counts()
4 259 1 35 0 24 2 13 3 9 Name: K_means_segments, dtype: int64
km_cluster_profile = df.groupby("K_means_segments").mean()
km_cluster_profile["count_in_each_segment"] = (
df.groupby("K_means_segments")["Current Price"].count().values
)
# let's display cluster profiles
km_cluster_profile.style.highlight_max(color="lightgreen", axis=0)
| Current Price | Price Change | Volatility | ROE | Cash Ratio | Net Cash Flow | Net Income | Earnings Per Share | Estimated Shares Outstanding | P/E Ratio | P/B Ratio | count_in_each_segment | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| K_means_segments | ||||||||||||
| 0 | 246.574304 | 14.284326 | 1.769621 | 26.500000 | 279.916667 | 459120250.000000 | 1009205541.666667 | 6.167917 | 549432140.538333 | 90.097512 | 14.081386 | 24 |
| 1 | 41.373681 | -14.849938 | 2.596790 | 27.285714 | 64.457143 | 34462657.142857 | -1293864285.714286 | -2.459714 | 450100420.905143 | 61.563930 | 2.476202 | 35 |
| 2 | 48.103077 | 6.053507 | 1.163964 | 27.538462 | 77.230769 | 773230769.230769 | 14114923076.923077 | 3.958462 | 3918734987.169230 | 16.098039 | -4.253404 | 13 |
| 3 | 72.783335 | 0.912232 | 2.015435 | 542.666667 | 34.000000 | -350866666.666667 | -5843677777.777778 | -14.735556 | 372500020.988889 | 53.574485 | -8.831054 | 9 |
| 4 | 72.768128 | 5.701175 | 1.359857 | 25.598456 | 52.216216 | -913081.081081 | 1537660934.362934 | 3.719247 | 436114647.527683 | 23.473934 | -3.374716 | 259 |
for cl in df["K_means_segments"].unique():
print("In cluster {}, the following companies are present:".format(cl))
print(df[df["K_means_segments"] == cl]["Ticker Symbol"].unique())
print()
In cluster 4, the following companies are present: ['AAL' 'ABBV' 'ABT' 'ADBE' 'ADM' 'AEE' 'AEP' 'AFL' 'AIG' 'AIV' 'AIZ' 'AJG' 'ALB' 'ALK' 'ALL' 'AMAT' 'AME' 'AMG' 'AMP' 'AMT' 'AN' 'ANTM' 'AON' 'APH' 'ATVI' 'AVB' 'AVGO' 'AWK' 'AXP' 'BA' 'BAX' 'BBT' 'BCR' 'BK' 'BLL' 'BMY' 'BSX' 'BWA' 'BXP' 'CAT' 'CB' 'CBG' 'CCI' 'CCL' 'CFG' 'CHD' 'CHRW' 'CI' 'CINF' 'CMA' 'CME' 'CMI' 'CMS' 'CNC' 'CNP' 'COF' 'COO' 'CSX' 'CTL' 'CTSH' 'CTXS' 'CVS' 'CVX' 'D' 'DAL' 'DD' 'DE' 'DFS' 'DGX' 'DHR' 'DIS' 'DISCA' 'DISCK' 'DLPH' 'DLR' 'DNB' 'DOV' 'DPS' 'DUK' 'DVA' 'EBAY' 'ECL' 'ED' 'EFX' 'EIX' 'EMN' 'EQR' 'ES' 'ESS' 'ETFC' 'ETN' 'ETR' 'EXC' 'EXPD' 'EXPE' 'EXR' 'FAST' 'FBHS' 'FE' 'FIS' 'FISV' 'FLIR' 'FLR' 'FLS' 'FMC' 'FRT' 'GD' 'GGP' 'GLW' 'GM' 'GPC' 'GRMN' 'GT' 'GWW' 'HAS' 'HBAN' 'HCA' 'HCN' 'HCP' 'HIG' 'HOG' 'HON' 'HPQ' 'HRL' 'HSIC' 'HST' 'HSY' 'HUM' 'IBM' 'IDXX' 'IFF' 'IP' 'IPG' 'IRM' 'ITW' 'IVZ' 'JBHT' 'JEC' 'JNPR' 'KIM' 'LEG' 'LEN' 'LH' 'LKQ' 'LLL' 'LLY' 'LMT' 'LNT' 'LUV' 'LVLT' 'LYB' 'MA' 'MAA' 'MAC' 'MAR' 'MAS' 'MAT' 'MCO' 'MDLZ' 'MET' 'MHK' 'MJN' 'MKC' 'MMC' 'MMM' 'MO' 'MPC' 'MTB' 'MTD' 'MYL' 'NAVI' 'NDAQ' 'NEE' 'NLSN' 'NSC' 'NTRS' 'NUE' 'NWL' 'O' 'OMC' 'ORLY' 'PBCT' 'PBI' 'PCAR' 'PCG' 'PEG' 'PEP' 'PFG' 'PG' 'PGR' 'PHM' 'PM' 'PNC' 'PNR' 'PNW' 'PPG' 'PPL' 'PRU' 'PSX' 'PX' 'PYPL' 'RCL' 'RHI' 'ROP' 'RSG' 'SCG' 'SCHW' 'SEE' 'SHW' 'SLG' 'SNI' 'SO' 'SPG' 'SRCL' 'SRE' 'STI' 'STT' 'SYF' 'SYK' 'TAP' 'TGNA' 'TMK' 'TMO' 'TRV' 'TSCO' 'TSN' 'TSO' 'TSS' 'TXN' 'UAL' 'UDR' 'UHS' 'UNH' 'UNM' 'UNP' 'UPS' 'UTX' 'VAR' 'VLO' 'VMC' 'VNO' 'VRSK' 'VRSN' 'VTR' 'WEC' 'WHR' 'WM' 'WU' 'WY' 'WYN' 'XEL' 'XL' 'XRAY' 'XRX' 'XYL' 'YUM' 'ZBH' 'ZION' 'ZTS'] In cluster 0, the following companies are present: ['ADI' 'ADS' 'ALXN' 'AMGN' 'AMZN' 'BIIB' 'CELG' 'CMG' 'EQIX' 'EW' 'FB' 'FSLR' 'FTR' 'ISRG' 'MCD' 'MNST' 'NFLX' 'PCLN' 'REGN' 'TRIP' 'VRTX' 'WAT' 'WYNN' 'YHOO'] In cluster 1, the following companies are present: ['AKAM' 'APC' 'ARNC' 'BHI' 'CF' 'COG' 'CXO' 'EOG' 'EQT' 'FCX' 'HAL' 'HES' 'HPE' 'KMI' 'KSU' 'LUK' 'MLM' 'MOS' 'MRO' 'MUR' 'NBL' 'NEM' 'NOV' 'OKE' 'OXY' 'PWR' 'R' 'RRC' 'SE' 'SWKS' 'SWN' 'TDC' 'UAA' 'WMB' 'XEC'] In cluster 3, the following companies are present: ['ALLE' 'APA' 'CHK' 'CHTR' 'CL' 'DVN' 'KMB' 'NFX' 'SPGI'] In cluster 2, the following companies are present: ['BAC' 'C' 'F' 'GILD' 'INTC' 'JPM' 'KO' 'MRK' 'PFE' 'T' 'VZ' 'WFC' 'XOM']
df.groupby(["K_means_segments", "GICS Sector"])["Ticker Symbol"].count()
K_means_segments GICS Sector
0 Consumer Discretionary 6
Consumer Staples 1
Health Care 9
Information Technology 6
Real Estate 1
Telecommunications Services 1
1 Consumer Discretionary 1
Energy 20
Financials 1
Industrials 4
Information Technology 4
Materials 5
2 Consumer Discretionary 1
Consumer Staples 1
Energy 1
Financials 4
Health Care 3
Information Technology 1
Telecommunications Services 2
3 Consumer Discretionary 1
Consumer Staples 2
Energy 4
Financials 1
Industrials 1
4 Consumer Discretionary 31
Consumer Staples 15
Energy 5
Financials 43
Health Care 28
Industrials 48
Information Technology 22
Materials 15
Real Estate 26
Telecommunications Services 2
Utilities 24
Name: Ticker Symbol, dtype: int64
plt.figure(figsize=(20, 20))
plt.suptitle("Boxplot of numerical variables for each cluster")
for i, variable in enumerate(num_col):
plt.subplot(3, 4, i + 1)
sns.boxplot(data=df, x="K_means_segments", y=variable, palette = 'Paired')
plt.tight_layout(pad=2.0)
Cluster 0
['ADI' 'ADS' 'ALXN' 'AMGN' 'AMZN' 'BIIB' 'CELG' 'CMG' 'EQIX' 'EW' 'FB' 'FSLR' 'FTR' 'ISRG' 'MCD' 'MNST' 'NFLX' 'PCLN' 'REGN' 'TRIP' 'VRTX' 'WAT' 'WYNN' 'YHOO']
Cluster 1
['AKAM' 'APC' 'ARNC' 'BHI' 'CF' 'COG' 'CXO' 'EOG' 'EQT' 'FCX' 'HAL' 'HES' 'HPE' 'KMI' 'KSU' 'LUK' 'MLM' 'MOS' 'MRO' 'MUR' 'NBL' 'NEM' 'NOV' 'OKE' 'OXY' 'PWR' 'R' 'RRC' 'SE' 'SWKS' 'SWN' 'TDC' 'UAA' 'WMB' 'XEC']
Cluster 2
In cluster 2, the following companies are present: ['BAC' 'C' 'F' 'GILD' 'INTC' 'JPM' 'KO' 'MRK' 'PFE' 'T' 'VZ' 'WFC' 'XOM']
Cluster 3:
['ALLE' 'APA' 'CHK' 'CHTR' 'CL' 'DVN' 'KMB' 'NFX' 'SPGI']
Cluster 4
['AAL' 'ABBV' 'ABT' 'ADBE' 'ADM' 'AEE' 'AEP' 'AFL' 'AIG' 'AIV' 'AIZ' 'AJG' 'ALB' 'ALK' 'ALL' 'AMAT' 'AME' 'AMG' 'AMP' 'AMT' 'AN' 'ANTM' 'AON' 'APH' 'ATVI' 'AVB' 'AVGO' 'AWK' 'AXP' 'BA' 'BAX' 'BBT' 'BCR' 'BK' 'BLL' 'BMY' 'BSX' 'BWA' 'BXP' 'CAT' 'CB' 'CBG' 'CCI' 'CCL' 'CFG' 'CHD' 'CHRW' 'CI' 'CINF' 'CMA' 'CME' 'CMI' 'CMS' 'CNC' 'CNP' 'COF' 'COO' 'CSX' 'CTL' 'CTSH' 'CTXS' 'CVS' 'CVX' 'D' 'DAL' 'DD' 'DE' 'DFS' 'DGX' 'DHR' 'DIS' 'DISCA' 'DISCK' 'DLPH' 'DLR' 'DNB' 'DOV' 'DPS' 'DUK' 'DVA' 'EBAY' 'ECL' 'ED' 'EFX' 'EIX' 'EMN' 'EQR' 'ES' 'ESS' 'ETFC' 'ETN' 'ETR' 'EXC' 'EXPD' 'EXPE' 'EXR' 'FAST' 'FBHS' 'FE' 'FIS' 'FISV' 'FLIR' 'FLR' 'FLS' 'FMC' 'FRT' 'GD' 'GGP' 'GLW' 'GM' 'GPC' 'GRMN' 'GT' 'GWW' 'HAS' 'HBAN' 'HCA' 'HCN' 'HCP' 'HIG' 'HOG' 'HON' 'HPQ' 'HRL' 'HSIC' 'HST' 'HSY' 'HUM' 'IBM' 'IDXX' 'IFF' 'IP' 'IPG' 'IRM' 'ITW' 'IVZ' 'JBHT' 'JEC' 'JNPR' 'KIM' 'LEG' 'LEN' 'LH' 'LKQ' 'LLL' 'LLY' 'LMT' 'LNT' 'LUV' 'LVLT' 'LYB' 'MA' 'MAA' 'MAC' 'MAR' 'MAS' 'MAT' 'MCO' 'MDLZ' 'MET' 'MHK' 'MJN' 'MKC' 'MMC' 'MMM' 'MO' 'MPC' 'MTB' 'MTD' 'MYL' 'NAVI' 'NDAQ' 'NEE' 'NLSN' 'NSC' 'NTRS' 'NUE' 'NWL' 'O' 'OMC' 'ORLY' 'PBCT' 'PBI' 'PCAR' 'PCG' 'PEG' 'PEP' 'PFG' 'PG' 'PGR' 'PHM' 'PM' 'PNC' 'PNR' 'PNW' 'PPG' 'PPL' 'PRU' 'PSX' 'PX' 'PYPL' 'RCL' 'RHI' 'ROP' 'RSG' 'SCG' 'SCHW' 'SEE' 'SHW' 'SLG' 'SNI' 'SO' 'SPG' 'SRCL' 'SRE' 'STI' 'STT' 'SYF' 'SYK' 'TAP' 'TGNA' 'TMK' 'TMO' 'TRV' 'TSCO' 'TSN' 'TSO' 'TSS' 'TXN' 'UAL' 'UDR' 'UHS' 'UNH' 'UNM' 'UNP' 'UPS' 'UTX' 'VAR' 'VLO' 'VMC' 'VNO' 'VRSK' 'VRSN' 'VTR' 'WEC' 'WHR' 'WM' 'WU' 'WY' 'WYN' 'XEL' 'XL' 'XRAY' 'XRX' 'XYL' 'YUM' 'ZBH' 'ZION' 'ZTS']
hc_df = subset_scaled_df.copy()
# list of distance metrics
distance_metrics = ["euclidean", "chebyshev", "mahalanobis", "cityblock"]
# list of linkage methods
linkage_methods = ["single", "complete", "average", "weighted"]
high_cophenet_corr = 0
high_dm_lm = [0, 0]
for dm in distance_metrics:
for lm in linkage_methods:
Z = linkage(subset_scaled_df, metric=dm, method=lm)
c, coph_dists = cophenet(Z, pdist(subset_scaled_df))
print(
"Cophenetic correlation for {} distance and {} linkage is {}.".format(
dm.capitalize(), lm, c
)
)
if high_cophenet_corr < c:
high_cophenet_corr = c
high_dm_lm[0] = dm
high_dm_lm[1] = lm
# printing the combination of distance metric and linkage method with the highest cophenetic correlation
print('*'*100)
print(
"Highest cophenetic correlation is {}, which is obtained with {} distance and {} linkage.".format(
high_cophenet_corr, high_dm_lm[0].capitalize(), high_dm_lm[1]
)
)
Cophenetic correlation for Euclidean distance and single linkage is 0.9232271494002922. Cophenetic correlation for Euclidean distance and complete linkage is 0.7873280186580672. Cophenetic correlation for Euclidean distance and average linkage is 0.9422540609560814. Cophenetic correlation for Euclidean distance and weighted linkage is 0.8693784298129404. Cophenetic correlation for Chebyshev distance and single linkage is 0.9062538164750717. Cophenetic correlation for Chebyshev distance and complete linkage is 0.598891419111242. Cophenetic correlation for Chebyshev distance and average linkage is 0.9338265528030499. Cophenetic correlation for Chebyshev distance and weighted linkage is 0.9127355892367. Cophenetic correlation for Mahalanobis distance and single linkage is 0.9259195530524591. Cophenetic correlation for Mahalanobis distance and complete linkage is 0.7925307202850002. Cophenetic correlation for Mahalanobis distance and average linkage is 0.9247324030159736. Cophenetic correlation for Mahalanobis distance and weighted linkage is 0.8708317490180428. Cophenetic correlation for Cityblock distance and single linkage is 0.9334186366528574. Cophenetic correlation for Cityblock distance and complete linkage is 0.7375328863205818. Cophenetic correlation for Cityblock distance and average linkage is 0.9302145048594667. Cophenetic correlation for Cityblock distance and weighted linkage is 0.731045513520281. **************************************************************************************************** Highest cophenetic correlation is 0.9422540609560814, which is obtained with Euclidean distance and average linkage.
Let's explore different linkage methods with Euclidean distance only.
# list of linkage methods
linkage_methods = ["single", "complete", "average", "centroid", "ward", "weighted"]
high_cophenet_corr = 0
high_dm_lm = [0, 0]
for lm in linkage_methods:
Z = linkage(hc_df, metric="euclidean", method=lm)
c, coph_dists = cophenet(Z, pdist(hc_df))
print("Cophenetic correlation for {} linkage is {}.".format(lm, c))
if high_cophenet_corr < c:
high_cophenet_corr = c
high_dm_lm[0] = "euclidean"
high_dm_lm[1] = lm
# printing the combination of distance metric and linkage method with the highest cophenetic correlation
print('*'*100)
print(
"Highest cophenetic correlation is {}, which is obtained with {} linkage.".format(
high_cophenet_corr, high_dm_lm[1]
)
)
Cophenetic correlation for single linkage is 0.9232271494002922. Cophenetic correlation for complete linkage is 0.7873280186580672. Cophenetic correlation for average linkage is 0.9422540609560814. Cophenetic correlation for centroid linkage is 0.9314012446828154. Cophenetic correlation for ward linkage is 0.7101180299865353. Cophenetic correlation for weighted linkage is 0.8693784298129404. **************************************************************************************************** Highest cophenetic correlation is 0.9422540609560814, which is obtained with average linkage.
Let's view the dendrograms for the different linkage methods with Euclidean distance.
# list of linkage methods
linkage_methods = ["single", "complete", "average", "centroid", "ward", "weighted"]
# lists to save results of cophenetic correlation calculation
compare_cols = ["Linkage", "Cophenetic Coefficient"]
compare = []
# to create a subplot image
fig, axs = plt.subplots(len(linkage_methods), 1, figsize=(15, 30))
# We will enumerate through the list of linkage methods above
# For each linkage method, we will plot the dendrogram and calculate the cophenetic correlation
for i, method in enumerate(linkage_methods):
Z = linkage(subset_scaled_df, metric="euclidean", method=method)
dendrogram(Z, ax=axs[i])
axs[i].set_title(f"Dendrogram ({method.capitalize()} Linkage)")
coph_corr, coph_dist = cophenet(Z, pdist(subset_scaled_df))
axs[i].annotate(
f"Cophenetic\nCorrelation\n{coph_corr:0.2f}",
(0.80, 0.80),
xycoords="axes fraction",
)
**Observations**
1. The cophenetic correlation is highest for average and centroid linkage methods.
2. We will move ahead with average linkage as its slightly higher cophentic correlation than centroid linkage method
3. 5 appears to be the appropriate number of clusters from the dendrogram for average linkage.
# create and print a dataframe to compare cophenetic correlations for different linkage methods
df_cc = pd.DataFrame(compare, columns=compare_cols)
df_cc = df_cc.sort_values(by="Cophenetic Coefficient")
df_cc
| Linkage | Cophenetic Coefficient |
|---|
HCmodel = AgglomerativeClustering(n_clusters=5, affinity="euclidean", linkage="average")
HCmodel.fit(subset_scaled_df)
AgglomerativeClustering(affinity='euclidean', linkage='average', n_clusters=5)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
AgglomerativeClustering(affinity='euclidean', linkage='average', n_clusters=5)
# creating a copy of the original data
df2 = df1.drop(['K_means_segments'], axis=1)
subset_scaled_df["HC_Clusters"] = HCmodel.labels_
df2["HC_Clusters"] = HCmodel.labels_
df2.head()
| Ticker Symbol | GICS Sector | GICS Sub Industry | Current Price | Price Change | Volatility | ROE | Cash Ratio | Net Cash Flow | Net Income | Earnings Per Share | Estimated Shares Outstanding | P/E Ratio | P/B Ratio | HC_Clusters | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | AAL | Industrials | Airlines | 42.349998 | 9.999995 | 1.687151 | 135 | 51 | -604000000 | 7610000000 | 11.39 | 6.681299e+08 | 3.718174 | -8.784219 | 0 |
| 1 | ABBV | Health Care | Pharmaceuticals | 59.240002 | 8.339433 | 2.197887 | 130 | 77 | 51000000 | 5144000000 | 3.15 | 1.633016e+09 | 18.806350 | -8.750068 | 0 |
| 2 | ABT | Health Care | Health Care Equipment | 44.910000 | 11.301121 | 1.273646 | 21 | 67 | 938000000 | 4423000000 | 2.94 | 1.504422e+09 | 15.275510 | -0.394171 | 0 |
| 3 | ADBE | Information Technology | Application Software | 93.940002 | 13.977195 | 1.357679 | 9 | 180 | -240840000 | 629551000 | 1.26 | 4.996437e+08 | 74.555557 | 4.199651 | 0 |
| 4 | ADI | Information Technology | Semiconductors | 55.320000 | -1.827858 | 1.701169 | 14 | 272 | 315120000 | 696878000 | 0.31 | 2.247994e+09 | 178.451613 | 1.059810 | 0 |
hc_cluster_profile = df2.groupby("HC_Clusters").mean()
hc_cluster_profile["count_in_each_segment"] = (
df2.groupby("HC_Clusters")["Current Price"].count().values
)
# let's display cluster profiles
hc_cluster_profile.style.highlight_max(color="lightgreen", axis=0)
| Current Price | Price Change | Volatility | ROE | Cash Ratio | Net Cash Flow | Net Income | Earnings Per Share | Estimated Shares Outstanding | P/E Ratio | P/B Ratio | count_in_each_segment | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| HC_Clusters | ||||||||||||
| 0 | 76.979246 | 4.142346 | 1.516898 | 35.200000 | 67.280597 | 67033707.462687 | 1610408549.253731 | 2.887716 | 573826376.546119 | 32.329515 | -2.152930 | 335 |
| 1 | 1274.949951 | 3.190527 | 1.268340 | 29.000000 | 184.000000 | -1671386000.000000 | 2551360000.000000 | 50.090000 | 50935516.070000 | 25.453183 | -1.052429 | 1 |
| 2 | 24.485001 | -13.351992 | 3.482611 | 802.000000 | 51.000000 | -1292500000.000000 | -19106500000.000000 | -41.815000 | 519573983.250000 | 60.748608 | 1.565141 | 2 |
| 3 | 104.660004 | 16.224320 | 1.320606 | 8.000000 | 958.000000 | 592000000.000000 | 3669000000.000000 | 1.310000 | 2800763359.000000 | 79.893133 | 5.884467 | 1 |
| 4 | 276.570007 | 6.189286 | 1.116976 | 30.000000 | 25.000000 | 90885000.000000 | 596541000.000000 | 8.910000 | 66951851.850000 | 31.040405 | 129.064585 | 1 |
There is one big cluster of 335 companies, 3 clusters of one company, 1 clusters of two companies. As this clustering has not achieved anything lets try Ward linkeage methods as its Dendogram it looks widely distributed.
HCmodel_ward = AgglomerativeClustering(n_clusters=5, affinity="euclidean", linkage="ward")
HCmodel_ward.fit(subset_scaled_df)
AgglomerativeClustering(affinity='euclidean', n_clusters=5)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
AgglomerativeClustering(affinity='euclidean', n_clusters=5)
# creating a copy of the original data
df3 = df1.drop(['K_means_segments'], axis=1)
subset_scaled_df["HC_Clusters"] = HCmodel_ward.labels_
df3["HC_Clusters"] = HCmodel_ward.labels_
df3.head()
| Ticker Symbol | GICS Sector | GICS Sub Industry | Current Price | Price Change | Volatility | ROE | Cash Ratio | Net Cash Flow | Net Income | Earnings Per Share | Estimated Shares Outstanding | P/E Ratio | P/B Ratio | HC_Clusters | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | AAL | Industrials | Airlines | 42.349998 | 9.999995 | 1.687151 | 135 | 51 | -604000000 | 7610000000 | 11.39 | 6.681299e+08 | 3.718174 | -8.784219 | 0 |
| 1 | ABBV | Health Care | Pharmaceuticals | 59.240002 | 8.339433 | 2.197887 | 130 | 77 | 51000000 | 5144000000 | 3.15 | 1.633016e+09 | 18.806350 | -8.750068 | 0 |
| 2 | ABT | Health Care | Health Care Equipment | 44.910000 | 11.301121 | 1.273646 | 21 | 67 | 938000000 | 4423000000 | 2.94 | 1.504422e+09 | 15.275510 | -0.394171 | 0 |
| 3 | ADBE | Information Technology | Application Software | 93.940002 | 13.977195 | 1.357679 | 9 | 180 | -240840000 | 629551000 | 1.26 | 4.996437e+08 | 74.555557 | 4.199651 | 0 |
| 4 | ADI | Information Technology | Semiconductors | 55.320000 | -1.827858 | 1.701169 | 14 | 272 | 315120000 | 696878000 | 0.31 | 2.247994e+09 | 178.451613 | 1.059810 | 0 |
hc_cluster_ward = df3.groupby("HC_Clusters").mean()
hc_cluster_ward["count_in_each_segment"] = (
df3.groupby("HC_Clusters")["Current Price"].count().values
)
# let's display cluster profiles
hc_cluster_ward.style.highlight_max(color="lightgreen", axis=0)
| Current Price | Price Change | Volatility | ROE | Cash Ratio | Net Cash Flow | Net Income | Earnings Per Share | Estimated Shares Outstanding | P/E Ratio | P/B Ratio | count_in_each_segment | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| HC_Clusters | ||||||||||||
| 0 | 72.618250 | 5.388970 | 1.429404 | 25.279863 | 70.788396 | 199298894.197952 | 1613911556.313993 | 3.629027 | 476295193.172730 | 24.789830 | -2.322255 | 293 |
| 1 | 489.313326 | 9.749852 | 1.687223 | 14.555556 | 152.666667 | 248373888.888889 | 655419333.333333 | 11.315556 | 160803192.135556 | 167.363342 | 24.013287 | 9 |
| 2 | 46.672222 | 5.166566 | 1.079367 | 25.000000 | 58.333333 | -3040666666.666667 | 14848444444.444445 | 3.435556 | 4564959946.222222 | 15.596051 | -6.354193 | 9 |
| 3 | 84.355716 | 3.854981 | 1.827670 | 633.571429 | 33.571429 | -568400000.000000 | -4968157142.857142 | -10.841429 | 398169036.442857 | 42.284541 | -11.589502 | 7 |
| 4 | 36.440455 | -16.073408 | 2.832884 | 57.500000 | 42.409091 | -472834090.909091 | -3161045227.272727 | -8.005000 | 514367806.201818 | 85.555682 | 0.836839 | 22 |
This is quite similar to K-means clustering with 5 clusters and will continue with this one.
for cl in df3["HC_Clusters"].unique():
print("In cluster {}, the following companies are present:".format(cl))
print(df3[df3["HC_Clusters"] == cl]["Ticker Symbol"].unique())
print()
In cluster 0, the following companies are present: ['AAL' 'ABBV' 'ABT' 'ADBE' 'ADI' 'ADM' 'AEE' 'AEP' 'AFL' 'AIG' 'AIV' 'AIZ' 'AJG' 'AKAM' 'ALB' 'ALK' 'ALL' 'AMAT' 'AME' 'AMG' 'AMGN' 'AMP' 'AMT' 'AN' 'ANTM' 'AON' 'APH' 'ARNC' 'ATVI' 'AVB' 'AVGO' 'AWK' 'AXP' 'BA' 'BAC' 'BAX' 'BBT' 'BCR' 'BIIB' 'BK' 'BLL' 'BMY' 'BSX' 'BWA' 'BXP' 'CAT' 'CB' 'CBG' 'CCI' 'CCL' 'CELG' 'CF' 'CFG' 'CHD' 'CHRW' 'CI' 'CINF' 'CMA' 'CME' 'CMI' 'CMS' 'CNC' 'CNP' 'COF' 'COO' 'CSX' 'CTL' 'CTSH' 'CTXS' 'CVS' 'CVX' 'D' 'DAL' 'DD' 'DE' 'DFS' 'DGX' 'DHR' 'DIS' 'DISCA' 'DISCK' 'DLPH' 'DLR' 'DNB' 'DOV' 'DPS' 'DUK' 'DVA' 'EBAY' 'ECL' 'ED' 'EFX' 'EIX' 'EMN' 'EQR' 'EQT' 'ES' 'ESS' 'ETFC' 'ETN' 'ETR' 'EW' 'EXC' 'EXPD' 'EXPE' 'EXR' 'FAST' 'FB' 'FBHS' 'FE' 'FIS' 'FISV' 'FLIR' 'FLR' 'FLS' 'FMC' 'FRT' 'FSLR' 'FTR' 'GD' 'GGP' 'GILD' 'GLW' 'GM' 'GPC' 'GRMN' 'GT' 'GWW' 'HAL' 'HAS' 'HBAN' 'HCA' 'HCN' 'HCP' 'HIG' 'HOG' 'HON' 'HPE' 'HPQ' 'HRL' 'HSIC' 'HST' 'HSY' 'HUM' 'IBM' 'IDXX' 'IFF' 'INTC' 'IP' 'IPG' 'IRM' 'ITW' 'IVZ' 'JBHT' 'JEC' 'JNPR' 'KIM' 'KSU' 'LEG' 'LEN' 'LH' 'LKQ' 'LLL' 'LLY' 'LMT' 'LNT' 'LUK' 'LUV' 'LVLT' 'LYB' 'MA' 'MAA' 'MAC' 'MAR' 'MAS' 'MAT' 'MCD' 'MCO' 'MDLZ' 'MET' 'MHK' 'MJN' 'MKC' 'MLM' 'MMC' 'MMM' 'MNST' 'MO' 'MOS' 'MPC' 'MRK' 'MTB' 'MTD' 'MYL' 'NAVI' 'NDAQ' 'NEE' 'NEM' 'NLSN' 'NSC' 'NTRS' 'NUE' 'NWL' 'O' 'OMC' 'ORLY' 'PBCT' 'PBI' 'PCAR' 'PCG' 'PEG' 'PEP' 'PFG' 'PG' 'PGR' 'PHM' 'PM' 'PNC' 'PNR' 'PNW' 'PPG' 'PPL' 'PRU' 'PSX' 'PWR' 'PX' 'PYPL' 'R' 'RCL' 'RHI' 'ROP' 'RSG' 'SCG' 'SCHW' 'SEE' 'SHW' 'SLG' 'SNI' 'SO' 'SPG' 'SRCL' 'SRE' 'STI' 'STT' 'SWKS' 'SYF' 'SYK' 'TAP' 'TGNA' 'TMK' 'TMO' 'TRIP' 'TRV' 'TSCO' 'TSN' 'TSO' 'TSS' 'TXN' 'UAA' 'UAL' 'UDR' 'UHS' 'UNH' 'UNM' 'UNP' 'UPS' 'UTX' 'VAR' 'VLO' 'VMC' 'VNO' 'VRSK' 'VRSN' 'VRTX' 'VTR' 'WAT' 'WEC' 'WHR' 'WM' 'WU' 'WY' 'WYN' 'WYNN' 'XEL' 'XL' 'XRAY' 'XRX' 'XYL' 'YHOO' 'YUM' 'ZBH' 'ZION' 'ZTS'] In cluster 1, the following companies are present: ['ADS' 'ALXN' 'AMZN' 'CMG' 'EQIX' 'ISRG' 'NFLX' 'PCLN' 'REGN'] In cluster 3, the following companies are present: ['ALLE' 'APA' 'CHK' 'CHTR' 'CL' 'KMB' 'SPGI'] In cluster 4, the following companies are present: ['APC' 'BHI' 'COG' 'CXO' 'DVN' 'EOG' 'FCX' 'HES' 'KMI' 'MRO' 'MUR' 'NBL' 'NFX' 'NOV' 'OKE' 'OXY' 'RRC' 'SE' 'SWN' 'TDC' 'WMB' 'XEC'] In cluster 2, the following companies are present: ['C' 'F' 'JPM' 'KO' 'PFE' 'T' 'VZ' 'WFC' 'XOM']
df3.groupby(["HC_Clusters", "GICS Sector"])["Ticker Symbol"].count()
HC_Clusters GICS Sector
0 Consumer Discretionary 35
Consumer Staples 16
Energy 7
Financials 45
Health Care 36
Industrials 52
Information Technology 30
Materials 19
Real Estate 26
Telecommunications Services 3
Utilities 24
1 Consumer Discretionary 3
Health Care 3
Information Technology 2
Real Estate 1
2 Consumer Discretionary 1
Consumer Staples 1
Energy 1
Financials 3
Health Care 1
Telecommunications Services 2
3 Consumer Discretionary 1
Consumer Staples 2
Energy 2
Financials 1
Industrials 1
4 Energy 20
Information Technology 1
Materials 1
Name: Ticker Symbol, dtype: int64
plt.figure(figsize=(20, 20))
plt.suptitle("Boxplot of numerical variables for each cluster")
for i, variable in enumerate(num_col):
plt.subplot(3, 4, i + 1)
sns.boxplot(data=df2, x="HC_Clusters", y=variable)
plt.tight_layout(pad=2.0)
Cluster 0
['AAL' 'ABBV' 'ABT' 'ADBE' 'ADI' 'ADM' 'AEE' 'AEP' 'AFL' 'AIG' 'AIV' 'AIZ' 'AJG' 'AKAM' 'ALB' 'ALK' 'ALL' 'AMAT' 'AME' 'AMG' 'AMGN' 'AMP' 'AMT' 'AN' 'ANTM' 'AON' 'APH' 'ARNC' 'ATVI' 'AVB' 'AVGO' 'AWK' 'AXP' 'BA' 'BAC' 'BAX' 'BBT' 'BCR' 'BIIB' 'BK' 'BLL' 'BMY' 'BSX' 'BWA' 'BXP' 'CAT' 'CB' 'CBG' 'CCI' 'CCL' 'CELG' 'CF' 'CFG' 'CHD' 'CHRW' 'CI' 'CINF' 'CMA' 'CME' 'CMI' 'CMS' 'CNC' 'CNP' 'COF' 'COO' 'CSX' 'CTL' 'CTSH' 'CTXS' 'CVS' 'CVX' 'D' 'DAL' 'DD' 'DE' 'DFS' 'DGX' 'DHR' 'DIS' 'DISCA' 'DISCK' 'DLPH' 'DLR' 'DNB' 'DOV' 'DPS' 'DUK' 'DVA' 'EBAY' 'ECL' 'ED' 'EFX' 'EIX' 'EMN' 'EQR' 'EQT' 'ES' 'ESS' 'ETFC' 'ETN' 'ETR' 'EW' 'EXC' 'EXPD' 'EXPE' 'EXR' 'FAST' 'FB' 'FBHS' 'FE' 'FIS' 'FISV' 'FLIR' 'FLR' 'FLS' 'FMC' 'FRT' 'FSLR' 'FTR' 'GD' 'GGP' 'GILD' 'GLW' 'GM' 'GPC' 'GRMN' 'GT' 'GWW' 'HAL' 'HAS' 'HBAN' 'HCA' 'HCN' 'HCP' 'HIG' 'HOG' 'HON' 'HPE' 'HPQ' 'HRL' 'HSIC' 'HST' 'HSY' 'HUM' 'IBM' 'IDXX' 'IFF' 'INTC' 'IP' 'IPG' 'IRM' 'ITW' 'IVZ' 'JBHT' 'JEC' 'JNPR' 'KIM' 'KSU' 'LEG' 'LEN' 'LH' 'LKQ' 'LLL' 'LLY' 'LMT' 'LNT' 'LUK' 'LUV' 'LVLT' 'LYB' 'MA' 'MAA' 'MAC' 'MAR' 'MAS' 'MAT' 'MCD' 'MCO' 'MDLZ' 'MET' 'MHK' 'MJN' 'MKC' 'MLM' 'MMC' 'MMM' 'MNST' 'MO' 'MOS' 'MPC' 'MRK' 'MTB' 'MTD' 'MYL' 'NAVI' 'NDAQ' 'NEE' 'NEM' 'NLSN' 'NSC' 'NTRS' 'NUE' 'NWL' 'O' 'OMC' 'ORLY' 'PBCT' 'PBI' 'PCAR' 'PCG' 'PEG' 'PEP' 'PFG' 'PG' 'PGR' 'PHM' 'PM' 'PNC' 'PNR' 'PNW' 'PPG' 'PPL' 'PRU' 'PSX' 'PWR' 'PX' 'PYPL' 'R' 'RCL' 'RHI' 'ROP' 'RSG' 'SCG' 'SCHW' 'SEE' 'SHW' 'SLG' 'SNI' 'SO' 'SPG' 'SRCL' 'SRE' 'STI' 'STT' 'SWKS' 'SYF' 'SYK' 'TAP' 'TGNA' 'TMK' 'TMO' 'TRIP' 'TRV' 'TSCO' 'TSN' 'TSO' 'TSS' 'TXN' 'UAA' 'UAL' 'UDR' 'UHS' 'UNH' 'UNM' 'UNP' 'UPS' 'UTX' 'VAR' 'VLO' 'VMC' 'VNO' 'VRSK' 'VRSN' 'VRTX' 'VTR' 'WAT' 'WEC' 'WHR' 'WM' 'WU' 'WY' 'WYN' 'WYNN' 'XEL' 'XL' 'XRAY' 'XRX' 'XYL' 'YHOO' 'YUM' 'ZBH' 'ZION' 'ZTS']
Cluster 1
['ADS' 'ALXN' 'AMZN' 'CMG' 'EQIX' 'ISRG' 'NFLX' 'PCLN' 'REGN']
CLusters 2
['C' 'F' 'JPM' 'KO' 'PFE' 'T' 'VZ' 'WFC' 'XOM']
Clusters 3
1.In cluster 3, the following companies are present:
['ALLE' 'APA' 'CHK' 'CHTR' 'CL' 'KMB' 'SPGI']
Clusters 4:
['APC' 'BHI' 'COG' 'CXO' 'DVN' 'EOG' 'FCX' 'HES' 'KMI' 'MRO' 'MUR' 'NBL' 'NFX' 'NOV' 'OKE' 'OXY' 'RRC' 'SE' 'SWN' 'TDC' 'WMB' 'XEC']
You compare several things, like:
You can also mention any differences or similarities you obtained in the cluster profiles from both the clustering techniques.
Which clustering technique took less time for execution?
Both K-Means model and the Agglomerative Clustering model took just few seconds to create the clusters.
Which clustering technique gave you more distinct clusters, or are they the same?
We used 5 clusters for both models. Both models have one big clusters with above 250 companies but K mean model has distict distribution of stocks within 4 models and able to clusters stocks on distinct properties better than Agglomerative clustering.
How many observations are there in the similar clusters of both algorithms?
As there is one big cluster of 250 above stocks in both models, there are quite many companies in big cluster. K-mean Cluster 3rd have very high ROE like Cluster 2 of Agglomerative model.
How many clusters are obtained as the appropriate number of clusters from both algorithms?
For K-mean 5 clusters are selected from high silhouette scores and elbow method. For Agglomerative clustering 5 clusters are chosen on based on dendogram.
Differences or similarities you obtained in the cluster profiles from both the clustering techniques.
Altough both K-model and Agglomerative clustering have 5 clusters, but there 4 smaller cluster are quite different on properties. Agglomerative clustering actually clustered stocks based on one prominent parameter on which one cluster is quite distinct (very far away ) from other. K-mean clustering created clusters based on combination of distict properties which separated one cluster from other with some overlapping. But K mean clustering created clusters more distinctly for investing purpose.
As K-mean Clustering did able to clusters in terms of investing opportunities, Trade&Ahead can use these clusters to create mutual funds as starting point for investing and avoid stocks which are not doing well in 13 week times.
~ Cluster 0 : High Cash ratio ; P/E ratio is high so slightly expansive; High current price; low to medium volatility. This clusters have good stocks but might be little overprised right now.
~ Cluster 1 : Negative price change ; High volatile; Negative EPS and income. Despite negative EPS, few stocks have P/E ratio >100, indicating investors are willing to pay higher price for these stock
~ Cluster 2 : High estimated outstanding shares; Negative net cash flow for 50% companies; Net positive net income; Low volatile stocks; Low P/E ratio. With net postitive income despite negative cash flow, these stocks are at bargain price for such low P/E ratio.
~ Cluster 3 : Very negative outlook; Negative EPS; Very High ROS; High Volatile stocks; Negative Net income This cluster constitute most poor performing stocks with very bleak outlook.
~ Cluster 4: Low P/E ratio; Net positive income; Wide Price change; Quite Volatile This cluster contain 259 stocks and furthur clustering could be done to separate stocks on volatility.
Depending upon financial goals and risk tolerance behaviour of clients, these clusters could be added to their portfolio.
Trade&Ahead can use these clusters and can create mutual funds or ETFs (exchange trading funds )based on them to be used as combination investment tool.
Based on clustering, and overall performance, worst perfroming stocks or clusters can slowly be removed from all portfolios.